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Abstract — Given a set of matches between a cDNA sequence 
and multiple genomic sequences (fc-dimensional fragments), we 
present tlie first subquadratic chaining algorithm for computing 
an optimal chain of colinear matches, while tolerating overlaps 
between the fragments. The fragments of the resulting chain serve 
as anchors for locating the cDNA in the given genomic sequences 
and for producing a multiple alignment. Moreover, the resulting 
chains identify the common genes among the given sequences 
and in the same time identify the syntenic regions (regions of 
conserved gene-order). Our algorithm is a new addition to the 
family of chaining algorithms permitting overlaps, because the 
previous algorithms are limited to fragments between a cDNA 
sequence and a single genomic sequence. 

I. Introduction 

A fundamental task of every genome annotation project is to 
locate each gene in the genome and to determine its structure. 
This knowledge serves as a basis for elucidating the gene 
function, detecting the regulatory elements, and studying the 
genome organization and evolution. One of the most successful 
methods for accomplishing this task is the mapping of cDNA 
sequences to the genomes they are transcribed from. A cDNA 
sequence is a complementary sequence to an mRNA. (The 
introns are spliced out.) Consequently, a perfect alignment of 
a cDNA sequence to the related genomic sequence locates 
the corresponding gene and directly reveals its exon-intron 
structure; see Figure 1 . The increasing number of full cDNA 
sequencing projects reflects the increasing popularity of this 
method. 

In [11], Shibuya and Kurochkin have presented an efficient 
algorithm to map a cDNA sequence to a single large genomic 
sequence. Their algorithm is based on a strategy that is 
composed of three phases: First, fragments of the type (rare) 
maximal exact matches are computed using the suffix tree 
in linear time and space. Second, a geometry based chaining 
algorithm finds a highest scoring chain of colinear fragments in 
0(771 log m) time and 0(m) space, where m is the number of 
the fragments. The novel part of this chaining algorithm is that 
overlaps between the fragments of the chain are tolerated to 
improve the chain coverage, while the worst case complexity is 
the same as the algorithms tolerating no overlap. (Algorithms 
permitting no overlaps have been presented in [1,6,9, 12].) 

Although the algorithm is relatively complicated due to 
the combination of range maximum queries (RMQs) and the 
candidate list paradigm, it is an important improvement over 
the graph based solution that takes 0{m^) time [11]. Finally, 
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Fig. 1. cDNA mapped to a genomic sequence. 



an alignment is produced by applying a traditional alignment 
algorithm (considering the exon-intron boundaries) on the 
regions between the fragments of the chain. The rationale 
behind permitting overlaps is that (1) overlapping fragments 
were found to be common in cDNA mapping, and usually 
occur at the exon-intron boundaries [11], (2) the chain cov- 
erage will improve, which is crucial for both increasing the 
sensitivity/specificity of the procedure and for speeding-up the 
mapping task. Regarding the sensitivity/specificity, permitting 
no overlapping in the chain will cause many fragments to be 
discarded from the respective chains, which probably leads to 
filtering out many of these chains, despite being significant. 
This is because the scores of these chains drop below the 
user-defined threshold. If the user attempts to overcome this 
drawback and decreases the threshold, many insignificant 
chains will not be filtered out and the specificity will decrease. 
As for the running time, reduced chain coverage results in 
spending much time in the third phase in which an alignment 
on the character level is computed to finish the mapping task. 

With the ever increasing number of sequenced genomes of 
very closely related species or of different strains of the same 
species, it is natural to extend the algorithm of Shibuya and 
Kurochkin to map a cDNA sequence to multiple genomes. 
This provides a method for identifying the common genes 
among the genomes and for identifying the syntenic regions 
(regions of conserved gene-order). 

Generating fragments from multiple genomic sequences 
can be easily achieved in linear time and space, using the 
suffix tree [8] or the enhanced suffix array [2]. Chaining 
fragments from k sequences (one of them being cDNA) 
while permitting overlaps can be generally done in 0{'m?) 
time by a straightforward modification of the graph-based 
approach of [8]. Unfortunately, the quadratic running time 
of this approach is a serious drawback for a large number 
of fragments. To overcome this obstacle, one has to use a 
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geometry-based solution to chain the fragments from k se- 
quences in subquadratic time. Unfortunately, the extension of 
the algorithm of Shibuya and Kurochkin is very complicated, 
if not infeasible, due to the difficulty of analyzing the overlaps, 
according to the suggested objective function, and due to the 
difficulty of combining the corresponding RMQs and candidate 
lists; Shibuya and Kurochkin noticed also these complications 
[11]. Nevertheless, we show in this paper that an efficient 
solution exists, if an objective function specific to the cDNA 
mapping problem is used. We present a chaining algorithm that 
runs in subquadratic time, which is a significant improvement 
over the naive graph-based algorithm. Our algorithm is easy 
to implement, because it uses solely RMQs without candidate 
lists. 

In the following section, the basic concepts and definitions 
used in this paper are presented. Section III introduces the 
chaining problem and discuss a graph based solution. In 
Section IV, we present our geometry based solution. Section 
V contains conclusions and future work. 

II. Basic concepts and definitions 

Let S'[l..r?.] denote a string of length n. S[i..j] denotes 
the substring of S'[l..n] starting at position i and ending at 
position j in S, and S[i] denotes the i character of S. The 
string S in our application represents a DNA sequence (cDNA 
or a genomic sequence). A multiple exact match between k 
sequences Si,.., Sk is a (fc + 1)— tuple {l,pi, ..,pk) such 
that Si[pi..pi + I - 1] = . . . = Sk[pk--Pk + 1-1], i.e., the 
i-character-long substrings of Si,..,Sk starting at positions 
pi, ..,pk, respectively, are identical. A multiple exact match is 
left maximal if Si [pi — 1] =/= Sj [pj — 1] for any 1 < i =/= j < k, 
and right maximal if Si [pi + I] ^ Sj [pj + I] for any I < i =/= 
j < k, i.e., it cannot be extended to the left and to the right 
simultaneously in all the sequences. A maximal multiple exact 
match (multi-MEM) is a left and right maximal multiple exact 
match. 

A multi-MEM {l,pi, ..,Pk) can be represented by a hyper- 
rectangle in M with the two extreme corner points (pi, ..,pk) 
and {pi + I — l,..,pk + I — 1). Such a hyper-rectangle is 
also called a fragment. Figure 3 shows an example of a 
two dimensional fragment, i.e., for A; = 2. In the following, 
we will identify a multi-MEM [l,pi, ..,pk) with its frag- 
ment / in R and denote the two extreme corner points 
by beg{f) = {beg{f).xi, ..,beg{f).Xk) = {pi,..,pk) and 
end{f) = {end{f).xi, ..,end{f).Xk) = (pi+l-l, ..,pk + l- 
1). Furthermore, we define /.length = I to denote the length 
of the multi-MEM corresponding to /. 

For ease of presentation, we consider the point = 
(0,...,0) (the origin) and the terminus t = {\Si\ — 
1, . . . , 15*^1 — 1) as fragments with weight 0. For these frag- 
ments, we define beg{0) = ±, end{0) = 0, beg{t) = t, and 
end{t) = ±, where ± stands for an undefined value. 

multi-MEMs can be readily computed by means of an index 
data structure, such as the suffix tree or the enhanced suffix 
array, using any of the algorithms in [2, 4, 5, 7, 8, 10]. All these 
algorithms take linear time and space. However, they differ in 
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Fig. 2. Overlapping multi-MEMs of two sequences. 



the kind of matches (some compute multi-MEMs from two 
sequences only and some compute multi-MEMs from multiple 
sequences.) and in the programming details through which the 
matches are generated. 

III. Chaining fragments with overlaps 

Definition 3.1: Let [li,pi, ..,pk) and (/2, Q'l, ••, ft) be two 
multi-MEMs with pi < qi, 1 < i < k. We say that 
{li,pi, ..,Pk) overlaps with (^2, i?!, ••, 9fc) in Si if and only 
if qi < Pi + h — I < Qi + h — 1, for any 1 < i < k. 

For fc = 2, Figure 2 (a) shows an example of two multi- 
MEMs overlapping in ^2 but not in 5*1, while Figure 2 (b) 
shows two multi-MEMs overlapping in both 5*1 and S2. 

Definition 3.2: The relation <^ on the set of fragments is 
defined as follows, f '^ f if and only if the following two 
conditions hold: beg{f').Xi < beg{J).Xi and end{f').Xi < 
end{f).Xi, for all 1 < i < fc. If /' < /, then we say that /' 
precedes f. The fragments /' and / are colinear if either /' 
precedes f or f precedes /'. 

Thus, two fragments are colinear if they appear in the 
same order in both sequences. Note that if we further have 
end{f').Xi < beg{f).Xi, for all 1 < i < fc, then /' and / are 
colinear and non-overlapping. 

For two sequences, Shibuya and Kurochkin [11] defined 
the overlap length of two fragments to be the maximum 
of the amount of overlap in 5i and in 6*2. In their paper, 
the score score(C') of a chain C of colinear fragments is 
the sum of the lengths of the fragments in C minus their 
overlap lengths. (Note that there is no gap cost between the 
matches because the large gaps correspond to introns.) As a 
consequence of their definition of the overlap length in the 
computation of a highest-scoring chain Cf ending with /, 
one has to consider four different cases corresponding to the 
regions A, B U Ci, C2, and D shown in Figure 3 (a). Each 
of these cases yields a candidate fragment /,, found either 
by RMQs over the corresponding region or by maintaining 
candidate lists, and a highest-scoring chain Cf^ of colinear 
fragments ending with /i, 1 < « < 4. Then score{C'f) is 
computed by scoreiC j) = niaxi{score(CjJ + f .length — 
overlap length{f,fi)}; see [11] for details. This algorithm 
takes 0(TOlog?7i) time and 0[ni) space, where m is the 
number of fragments. Unfortunately, a direct extension of this 
algorithm to k sequences, one of them being cDNA, is very 
complicated because (1) the suggested penalty for overlaps 
results in complicated non-orthogonal space division, and (2) 
the combination of the RMQs and the candidate list paradigm 
becomes difficult to realize for fc > 2. In the following, 
we overcome these problems by (1) presenting a measure 
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Fig. 3. The fragment / is represented by a 2D rectangle whose 
extreme comer points beg{f) and end{f) are drawn as black circles, 
(a) The four cases that have to be considered in Shibuya and 
Kurochkin's approach, (b) The two cases Au B and CUD that 
have to be considered in our approach. 



of overlaps that considers the fact that one of the sequences 
is cDNA, (2) formulating the objective function as the one 
that maximizes the coverage of the cDNA w.r.t. the genomic 
sequences, and (3) using solely RMQs. 

Definition 3.3: For any two fragments / and /' in k di- 
mensional space, where the k axis corresponds to the cDNA 
sequence, the amount of overlap in the cDNA sequence is 

{end{f').Xk - beg{f).Xk + 1, 
if beg{f).Xk < end{f').Xk < end{f).Xk 
0, otherwise 
Based on this measure of overlap, the cDNA chaining problem 
can be formulated as follows. 

Definition 3.4: Given a set of m fragments, find a chain 
C of colinear fragments /i,/2,..,/t (i.e., h <^ h ^ 
^ ft) such that scoreiC) = y2j=^ fi-length — 
Y.i=i overlapkifi, fi+i) is maximal. 

This objective function maximizes the amount of cDNA 
sequence mapped to the genomic sequence. It is easy to 
see that a perfect mapping has a score that equals the cDNA 
length (percentage coverage being 100%). This function does 
not conflict with aligning the sequences on the character level, 
because overlaps can always be resolved by modifying the 
involved multi-MEMs boundaries and/or performing character 
indels of zero cost in the genomic sequences. Computationally, 
this objective function has the advantage, w.r.t. the space 
devision around a fragment /, that only two regions (defined 
soon) have to be considered, independently of k. As we will 
show later, this means that the chaining problem for fragments 
from k sequences, one of them being cDNA, can be solved 
by only two RMQs in these regions. 

A straightforward solution to the cDNA chaining problem 
is to construct a weighted directed acyclic graph G{V, E), 
where the set of vertices V is the set of fragments (including 
and t), and the set of edges E is characterized as follows. 
For any two nodes v' = f' and v = f, there is an edge 
e[v' —fv)GE with weight of /.length — overlap{f' , f), 
only if /' <^ f. An optimal chain of fragments corresponds 
to a path with maximum score from vertex to vertex t in 
the graph. Because the graph is acyclic, such a path can be 
computed as follows. Let /.score denote the maximum score 
of all chains ending with the fragment /. When we speak 
about the score of the points beg{f) and end(f) we implicitly 
mean the score of /. Clearly, /.score can be computed by the 



recurrence 
/.score = /.length + niax{/ .score — overlapk{/ , /)\/ ^ /} 

A dynamic programming algorithm based on this recurrence 
takes 0(1^^1 + \E\) time provided that computing overlaps 
takes constant time. Because \V\ + \E\ G O(m^), computing 
an optimal chain takes quadratic time. This quadratic running 
time of this graph based approach is a drawback for a large 
number of fragments. In the following section, we present a 
geometry based solution that runs in subquadratic time. 

IV. Geometric based Solution 

Algorithm 4.1 is a geometric solution to this recurrence. The 
algorithm assumes that the k axis corresponds to the cDNA. 
It is based on a sweep-line procedure w.r.t. the first genomic 
sequence and it uses RMQs to find the fragment that maximizes 
the score. There are two data structures Di and D2 to answer 
(2k — 2) -dimensional RMQs with activation. Each fragment / 
is represented in Di and D2 by the (2fc — 2)-dimensional point 
{end{/).xi,..,end{/).Xk,beg{/).X2,..,be9{f)-Xk-i)- For ex- 
ample, for fc = 3 / is represented by the 4-dimensional 
point {end(/).xi, end{/).X2, end{f).X3,beg(/).X2). In 
this algorithm, the function RMQo i[pi--qi], ■■,[pk--<lk], 
b2--92])-->bfc-i--9fc-i])' J ^ {1>2}, is a range maximum 
query that retrieves the fragment of maximum score in the 
set of active fragments/points stored in the data structure Dj. 
By the construction of Di and D2, this fragment has its 
end point within the hyper-rectangular region defined by the 
intervals [pi..qi], ..,[pk..qk] in Si,..,Sk, respectively, and its 
start point within the hyper-rectangle defined by the intervals 
[P2-<l2\,-dPk-i-<lk-i] in S2,..,Sk-i, respectively. As we 
will see soon, the restriction added on the fragment start points 
guarantees that the fragment to be connected to / precedes 
/. As we will discuss later Di and D2 can be efficiently 
implemented either as range trees or kd-trees. 

Algorithm 4.1: 
Sort all start points of the m fragments in ascending order w.r.t. 
their xi coordinate and store them in the array points. 
For each fragment /, create the point {end{/).xi, .., end{/).Xk, 
beg(/).X2, .., beg{/).Xk-i) and store it as inactive in the data 
structures Dj and D2. 
for 1 < i < m 

determine the fragment / with beg{/).xi = points[i] 

(b.xi, .., b.Xk) ■■= {beg{/).xi, .., beg{/).Xk) 

{e.xi, .., e.Xk) := {end{/).xi, .., end{/).Xk) 

qi := RMQoi([0..e.Xi - 1], .., [0..e.Xk-i - 1], [Q.-b.Xk - 1], 
[O..6.X2 -l],..,\Q..b.Xk-i -1]) 

92 := RMQo^dO.-e.Xi - 1], .., [0..e.Xk-i - 1], [b.Xk..e.Xk - 1], 
[Q..b.X2 -l],..,\Q..b.Xk-i -1]) 

determine the fragments /i and /2 corresponding to qi and g2, 

respectively 

if /i = ± then scorei = 

/• ± means no fragment found in the region •/ 

else scorci = /i. score 

if /2 = ± thien score-z = 

/-k 1. means no fragment found in the region -k/ 

else 

itbeg{/2).Xk < beg{/).Xk then 

scores = /-i-score — {end(/2).Xk — beg[/).Xk) 
else /2 = ±, and score2 = 



PROG. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 ' 



f .score = /.length + max{scorei, score2} 

if scorei > score2 > then connect /i to / 

else if score2 > then connect /2 to / 

activate {e.xj, .., e.Xk, b.X2, .., b.x^^i) in Dj witli score /.score 

activate (e.xi, .., e.Xk, b.X2, .., b.Xk-i) in D2 witli score 

{/.score — end{/).Xk) 

Before proving the correctness of this algorithm, we 
would like to explain it. When beg{f) is scanned, 
we search for a fragment /' that precedes / and 
maximizes /'.score — overlapkif , f). Geometrically, 
we search for the fragment /' such that end{f') lies 
in the region ABCD{f) defined by the hyper-rectangle 
([0..end{f).xi — 1], ..,[0..end{f).Xk — 1]), provided that 
beg{f') lies in the region A{f) defined by the hyper-rectangle 
{[0..beg{f).xi - 1], .., [0..beg{f).Xk - 1]). In Figure 3 (b), the 
region ABCD{f) is the region AuBUCUD, and A{f) is 
A, where k = 2. Note that the notation ABCD{f) (similarly 
for any subregion of it like A{f)) emphasizes the dependence 
of this region on /. To take overlaps into account, we have 
to divide the search region ABCD{f) into two subregions. 
The first subregion is the hyper-rectangle {[0..end{f).xi — 
l],..,[0..end{f).Xk-i - 1],[0. .beg{f).Xk - 1]), denoted by 
AB{f). Any fragment in this region does not overlap with / 
in the cDNA sequence. AB(f) is the region AU B in Figure 
3 (b). In the algorithm, RMQ^ij finds the fragment fi <. f 
of highest score in AB{f), as follows. The first k ranges of 
RMQ^ij restrict the search space to those fragments with end 
points in AB{f), and the last {k-2) ranges of RMQ^i^ add the 
constraint that the start points of these fragments are in A{f). 
Note that we use only (fc — 2) ranges for the constraint that 
the the start points are in A(f), i.e., the ranges [0..beg{f).xi\ 
and [0..beg{f).Xk] are not explicitly included in RMQ^)^. The 
reason is that any fragment /' with beg{f').xi > beg{f).xi 
is not yet activated in Di, and the k argument of RMQ^i^, 
which restricts that end{f2).Xk G [0..beg{f).Xk], implies 
that beg{f2).Xk G \Q..beg{f).Xk\. That is, it is superfluous 
to include these ranges, and unnecessarily increase the 
dimensionality of the RMQ. From the fragments satisfying the 
range constraints, a fragment /i of highest score is retrieved. 
For A: = 3, e.g., RMQ^)^ ([0..end(/).a;i], [0..en(i(/).X2], 
\Q..beg{f).x:i\,[Q..beg{f).X2\) yields the highest scoring 
fragment /i. The first three ranges of RMQ^)^ assure that 
end(fi) G AB{f). The fourth range guarantees that 
beg{fi).X2 G [0..beg{f).X2], which is enough to assure that 
begifi) G Aif). 

The second subregion we examine when scanning 
a fragment / is the hyper-rectangle {[0..end{f).xi — 
l],..,[0..end{f).xk-i - l],[beg{f).Xk..end{f).Xk - 1]), de- 
noted by CD{f). Any fragment ending in this region overlaps 
with / in the cDNA sequence. CD{f) is the region C U D 
in Figure 3 (b), where fc = 2. In order to penalize this 
overlap, we activate each end point in D2 with f ..score — 
end(f).Xk instead of f., score. This guarantees, as we shall 
see in the correctness proof, that a fragment /2 will be found 
such that f2. score — overlapk{f2, f) is maximal in CD{f). 
The (2fc — 2)-dimensional RMQ over D2 finds a fragment 



/2 of highest score in this region such that beg{f2).Xj G 
[0..beg{f2).Xj],2 < j < k - 1. The range [0..beg{f).xi] 
is not explicitly included in RMQ^)^, because any fragment 
/' with beg{f').xi > beg{f).xi is not yet activated in D2. 
The range [0..beg{f).Xk] is also not included, because if 
t'egif2)-Xk > beg{f).Xk, we simply ignore /2. This does not 
affect the correctness of the algorithm, as we shall see in the 
correctness proof, because then there is a fragment in AB{f) 
whose score is at least as high as that of /2. Finally in the 
algorithm, if fi. score > f2.score — overlapk{f, f2), then /i 
is connected to /. Otherwise, /2 is connected to /. 

For a formal correctness proof, we need the following 
definition and lemmata. 

Definition 4.1: The priority of a fragment /, denoted by 
J .priority, is defined as f .priority = f .score — end(f).Xk, 
where the k axis corresponds to the cDNA. 

Lemma 4.1: Let /, /' and /" be three fragments with 
end{f') G CD{f) and end{f") G CD{f). We have 
J" .priority < f .priority if and only if f" .score — 

overlapkif", /) < /'.score - overlapkif , /). 
Proof: 

/" .priority < /' .priority 
<=> /" .score — end(/ ).Xk < /' .score — end{/').Xk 

by adding beg(/).Xk to both sides, we obtain 

/" .score — overlapk{/" , /) < /' .score — overlapk{/' , /) 

Note that in the lemma < can be replaced with <. ■ 

Thus, if /' is a fragment with highest priority in CD{f), 
then /'.score — overlapkij' , f) is maximal in CD{f). The 
priority of a fragment /' is independent of /. Hence, it can be 
computed in constant time when /' is scanned. This has the 
advantage that the overlaps between all fragments need not be 
computed in advance (note that this would yield a quadratic 
time algorithm). 

Lemma 4.2: Let C be a chain composed of the fragments 
fi,..,ft. For every index i, I < i < t — 1, we have 
fijf^i. priority < fi.priority. 
Proof: 

/i+i.priority = /i+i.score - end{/i+i).Xk 

= /i..score + /i+i. length - overlapk(/i, /i+i)- 

end{/i+i).Xk 
= /i. score - beg{/i+i).Xk - overlapk{/i, /i+i) + 1 

< /{.score — end{/i).Xk 

< /i. priority 

Note that if overlapki/i, /i+i) = 0, then /i+i. priority < 
/i .priority. ■ 

Theorem 4.1: Algorithm 4.1 correctly computes an optimal 
chain. 

Proof: When the sweep-line reaches the start point 
beg{f) of fragment /, the end points of all fragments that 
started before beg{f).xi are already activated in the data struc- 
tures Di and D2. The end points of the remaining fragments 
are still inactive. This guarantees that each fragment whose end 
point lies in AB{f) or CD{f) but whose start point occurs 
after beg{f).xi will not be considered in what follows. Be- 
cause each fragment with end point in region AB{f) does not 
overlap with / in the cDNA sequence, the (2fc — 2)-dimensional 
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range query over Di retrieves the fragment of highest score 
in this region and it is guaranteed, as discussed before, that 
beg{f2) G A{f). Analogously, because a fragment /' with end 
point in region CD(f) overlaps with / in the cDNA sequence, 
it is activated in D2 with priority f .score — end(f').Xk. The 
(2k — 2)-dimensional range query over D2 yields the fragment 
/2 of highest priority ending in CD{f) as follows: By Lemma 
4.1, f2.score — overlapk{f2, f) is maximal in CD{f). More- 
over, it is guaranteed, by the sweep-line procedure and by the 
range query, that beg{f2)-Xj < beg{f).Xj,l < j < k — 1. 
However, it is not guaranteed that beg{f2).Xk < beg{f).Xk, 
which violates that beg(f2) G ^(/)- To handle this, we 
proceed by case analysis. Suppose first that beg{f2).Xk < 
beg{f).Xk, i.e., the start point of /2 lies in A(f). In this case 
we connect /2 to /, if f2.score — overlapk{f2, f) > fi-score. 
Otherwise, we connect /i to /, and we are done. Now suppose 
that beg{f2).Xk > beg{f).Xk, i.e., the start point of /2 lies 
in CD{f). This implies that overlapk{f2, f) > f2-length. 
According to Lemma 4.2, if there is a fragment /' connected 
to /2 (i.e., /' is the predecessor of /2 in a highest-scoring 
chain ending with /2), then the end point end{f ) of /' must 
lie in A{f). Hence, overlapkif , /2) = and we have 

/ .score = f .score — overlapk{f , f-i) 

= f2. length + / .score — overlapkif , f-i) — Ji.length 

= f2. score — f2. length 

> /2 .score — overlapk {J2 , /) 

Recall from Lemma 4.1 that f 2. score — overlapk{f2, f) 
is maximal in CD{f). Now it follows from /'.score > 
f2. score — overlapk{f2, f) in conjunction with /'.score < 
f I. score that /2 can be safely ignored. (If /2 has no predeces- 
sor, then /2 can be also safely ignored by Lemma 4.1 because 

f2.score - overlapkif2, f) = f2-length - overlapk{f2, f) < 
0.) ■ 

The complexity of Algorithm 4.1 depends on the com- 
plexity of the RMQs with activation supported by the data 
structures Di and D2. If Di and D2 are implemented as 
range trees supported by the technique of fractional cascading 
and enhanced with priority queues as shown in [1], then the 
complexity of the algorithm is 0(r?T,log ^ mloglogm) time 
and 0{m\og ^ m) space, d = 2fc — 2. If the M-tree is used 
instead of the range tree, then the algorithm takes 0{m'^^d) 
time and 0{m) space. Interestingly, the query time of the kd- 
tree can be improved in practice using a set of programming 
tricks [3]. If the gaps between successive fragments in a chain 
are constrained to be at most W characters long (e.g., W could 
be set to the estimated maximum intron length), then RMQ^i^ 
and RMQ^)^ have to be limited to the regions {[e.xi—W..e.xi — 
1], .., [e.Xk-i -W..e.Xk-i -I], [b.Xk-W..b.Xk-l], [O..6.X2- 
1], .., [0..b.Xk-i - 1]), and {[e.xi - W..e.xi - 1], .., [e.Xk-i - 
W..e.Xk-1-l], [b.Xk..e.Xk-l], [O..6.X2-I], .., [0..6.Xfc_i-l]), 
respectively, where b and e are the start and end points of a 
fragment. 



to implement because it depends on RMQs that can be im- 
plemented either using the range tree or the kd-tree. In a 
further research, we have found that the complexity of the 
algorithm can be improved for two special cases that are 
of practical interest (script in preparation): The first is the 
usage of multi-MUMs and rare multi-MEMs ' . The usage of 
multi-MUMs or rare multi-MEMs will speed-up the whole 
procedure, because repetitions will be filtered out. Moreover, 
there will be no need to use any repeat-masking program; a 
step that takes many hours. The second special case is when 
the amount of overlap between the fragments is constrained 
(i.e., allowed overlap length is at most£— 1 characters, where i 
is the minimum fragment length). This constraint is practically 
relevant, because overlaps are usually short. We have found 
that the time complexity for these two special cases can be 
reduced by nearly log n factor. 
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V. Conclusions 

In this paper, we have presented a subquadratic chaining 
algorithm that permits overlaps between the fragments, which 
are matches of the type multi-MEMs. Our algorithm is easy 



'a multi-MEM (l,pi, ..,piS] is called rare of repeat-value r, if in each 
Si the substring Si\pi..pi + J — 1] occurs at most r times, 1 < i <. k. A 
maximal multiple unique match (multi-MUM) is a rare multi-MEM such that 
r = 1, i.e., Silpi..pi -j- 1 — 1] occurs exactly once in each S;. 
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